Phonetic differences between affirmative and feedback head nods in German Sign Language (DGS): A pose estimation study

This study investigates head nods in natural dyadic German Sign Language (DGS) interaction, with the aim of finding whether head nods serving different functions vary in their phonetic characteristics. Earlier research on spoken and sign language interaction has revealed that head nods vary in the form of the movement. However, most claims about the phonetic properties of head nods have been based on manual annotation without reference to naturalistic text types and the head nods produced by the addressee have been largely ignored. There is a lack of detailed information about the phonetic properties of the addressee’s head nods and their interaction with manual cues in DGS as well as in other sign languages, and the existence of a form-function relationship of head nods remains uncertain. We hypothesize that head nods functioning in the context of affirmation differ from those signaling feedback in their form and the co-occurrence with manual items. To test the hypothesis, we apply OpenPose, a computer vision toolkit, to extract head nod measurements from video recordings and examine head nods in terms of their duration, amplitude and velocity. We describe the basic phonetic properties of head nods in DGS and their interaction with manual items in naturalistic corpus data. Our results show that phonetic properties of affirmative nods differ from those of feedback nods. Feedback nods appear to be on average slower in production and smaller in amplitude than affirmation nods, and they are commonly produced without a co-occurring manual element. We attribute the variations in phonetic properties to the distinct roles these cues fulfill in turn-taking system. This research underlines the importance of non-manual cues in shaping the turn-taking system of sign languages, establishing the links between such research fields as sign language linguistics, conversational analysis, quantitative linguistics and computer vision.

Answer: All images in the article depicting persons are taken from the Public DGS Corpus, the dataset that this study is based on.Its license conditions permit the use of video and image materials for use in linguistic research studies and publications.These conditions are in line with the informed consent that the dataset creators received from participants.The consent forms are not publicly available, but the following documentation clarifies that our use of the data is covered: https://doi.org/10.25592/uhhfdm.1745 7. Please review your reference list to ensure that it is complete and correct.If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references.Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.If you need to cite a retracted article, indicate the article's retracted status in the References list and alo include a citation and full reference for the retraction notice.

Answer:
The following 4 entries at the reference list featured broken dois, which are now fixed: 1. Napier J, Carmichael A, Wiltshire A. Look-Pause-Nod: A linguistic case study of a Deaf professional and interpreters working together.In: Hauser PC, Finch KL, Hauser AB, editors.Deaf professionals and designated interpreters: A new paradigm.In the rest of the refences we didn't make any changes.None of the cited papers are retracted or in the retraction process.

II. Author's notes
During the revision process we have noticed that our code and annotation files include some discrepancies when it comes to the number of occurrences of different types of nods.The changes were minimal, adding 1 or 2 values to some categories.We have eliminated those in the current submission version.This is why the current version features some changes to the values and measures reported previously.The values reported previously have been consistent with the code, but not with the ELAN annotation files.These changes are very slight and do not change our results or our narrative in any way.They however influence the fact that we need to apply changes to our figures (no. 5, 6, 7, 8, 9, 10, 11, 14) and tables (no. 4 and 5).The current version of the manuscript features the new tables and new figures and uploaded together with the current submission.The Zenodo link in the "supporting information" section now features the most current versions of the python and R codes as well as ELAN files.All of those resources yield numbers that are consistent with the values reported in the paper.
We noticed that the red line which was mentioned in the description of Figure 3 is not visible on the actual illustration.The description was updated to correlate with the illustration.

III. Answers to the comments made by Reviewer #1:
General assessment by Reviewer #1: Reviewer #1: The only thing I found a bit hard to follow while reading the manuscript was that the materials used in the study are described before the annotations are explained.So, I already knew before that there would be two types of head nods because this requirement led the authors to actually include additional data, but only in the section after the "materials" section there were details on how these annotations were created in the first place and how types of head nods were distinguished.To me it would make more sense to start with the criteria used for annotations, then describe the materials in the next section (including the information as to why the dataset had to be expanded in an unanticipated manner, i.e. due to the imbalance in observations), and then continue with the pose estimation section.

Answer:
We have changed the order of those two sections.We start with the criteria used for annotations and then describe the DGS corpus data in the next section before we continue with the pose estimation.We have also made some minor adjustments to the text to better align with the new structure.This mostly covers removing a reference to data from "Annotation" and restructuring "Data" into having two subsections, "Source material" where we describe the corpus" and "Annotated material" where we describe what data we annotated and how/why we split it up.
Reviewer #1: Focusing on the section about data selection, the general description as to how data was selected initially and how the dataset was then expanded is a bit sloppy and hard to follow.I understand that the authors took this step due to the uneven distribution of the types of head nods in the initial dataset, but this choice also makes the distribution of both types of head nods in the dataset meaningless (which should be clearly pointed out).I wonder if a pregistration of the desired annotations and minimum required data points for conducting meaningful statistical analyses would have helped the authors here to tell their story more elegantly, instead of more or less randomly selection video files and ending up with a very uneven distribution of head nod types and then adding just one type of nod data to be able to do statistics?Because the authors are interested in the phonetic differences between nods their approach is not a problem.But maybe preregistration of descried minimum sample sizes per type, etc. or similar is something to consider for future work?

Answers:
We have revised the general overview about the initial data selection process and the subsequent expansion of the dataset to enhance clarity and coherence.
The point about the distribution of both types of head nods in the dataset being meaningless is a valid one.We realize that the fact that most of the feedback nods come from a different part of the dataset than the affirmative nods seems to be a limitation because we cannot make valid statements about distribution or frequency of different types of nods in the data and we point it out clearly.However, our main aim was not to talk about frequency of different types of head nods, as also highlighted by the reviewer 1, but about their phonetic properties, and for this we needed an augmented set of affirmation nods.There might have been a more elegant way of achieving this, we appreciate this being raised up and we will take it into consideration in future research by applying the method of preregistration.In the follow-up study, we apply a pregistration of the desired annotations and rather rely on the minimum required data points for conducting meaningful statistical analyses (which is 20 per cell according to the general recommendation), instead of a random selection video files and ending up with a very uneven distribution.
Reviewer #1: Again relating to the annotations, a minor issue I noticed was that the authors provide inter-rater agreement for the length of annotations, but not for their coding of function and turn taking.]I wondered why that was not done and/or why this information is not provided given that, for example, around 150 out of about 650 annotations of head not function were coded as "other"?
Answer: Thank you for this comment.It is true we are not providing IAR measures for coding for nods' functions and turn-taking (TT) activity.That is because when it comes to coding for function and TT we had way more joint work before progressing to separate annotations and therefore our agreement must have been higher.Also, IAA measures wouldn't work well here, as the annotations were not done blindly (that is, we discussed cases we were unsure about, which we haven't done for form coding before calculating IAA).What is more, the length of the function and TT annotation was exactly the same as for form (as they were produced on child tiers and the spans of the form annotations were copied exactly), so ELAN's modified kappa would yield a result that would be 'negatively' positive (too high), as it takes into consideration both overlap of annotations and their value.Lastly, on each of these tiers we only used two annotation values, which would also artificially strengthen our kappas.We have added this explanation to the manuscript.
Reviewer #1: Lastly, another minor concern I had while reading was that sometimes I had the feeling that the manuscript currently lacks concision and, in parts, is written in a too narrative style with many fundamental pieces of information repeated again and again in different sections.I have provided some notes for parts that felt highly repetitive as part of my line-by-line comments below] and hope that it will help the authors to make their manuscript more concise.However, please consider that I did not list all parts where I had the feeling that the text could be more concise or felt repetitive.
Answer: Thank you for bringing this up.We have gone through the whole text of the manuscript and made it more concise by deleting the parts which were written in a too narrative style with many pieces being repeated in different sections.
Reviewer #1: I have also listed other small and/or specific comments below with the aim to help the authors improve their manuscript before resubmission.Provided that the authors are willing to make their text more concise and incorporate my comments wherever they see fit, I recommend that such a revised version of the manuscript should be accepted for publication (in that case, I would not need to see such a revised version but leave that at the discretion of the editor).
Answer: Thank you so much for these comments, they are immensely useful and we appreciate your time.We are very much willing to make our text more concise and incorporated all of the comments provided by the Reviwer#1.All of the changes are marked in orange color in the revised version.
Line-by-line comments of the Reviewer #1: Abstract: I'd spell out "German Sign Language" and give the abbreviation "DGS" in brackets, but I guess that's a matter of taste and/or requirements of the journal's stylesheet.
Answer: We implemented this change in the manuscript, both in the abstract and in the section "The current study".
ln. 2: I am not a native speaker, but my intuition would be to put this into plural, so make it "head nods are …" and "interactions"?But if the authors are native speakers or have consulted with a native speaker, please disregard this comment.
Answer: Thank you.We have checked it with a native speaker, who suggested the following wording: Head nod is one of the most commonly produced bodily signals in interaction.
ln. 220: As the abbreviation "DGS" has already been introduced above it should be used consistently.

Answer:
We generally agree with this comment, however in this particular context using the abbreviation would yield a weird and circular reading: "The DGS Corpus is an annotated reference corpus of DGS [...]", therefore we would like to stick with the full name here.
ln. 220-232: Is all this background information on the corpus really relevant, given that you use only (parts) of the publicly available data anyway?
Answer: Description of the Corpus has been tightened in the manuscript.
ln. 245-247: How were these samples drawn?Did the authors employ some kind of random sampling procedure or similar?If not, why not?
Answer: Thank you for this valid point.For this study we didn't do any sampling, but we will do it in future studies.We see head nodding as such a pervasive phenomenon (appearing across all genders, age groups, etc.) that we didn't anticipate a lot of variation other than variance based on the personal signing style (and therefore included many individual participants and text-types in the sample).
ln. 306: The author write "two tags were identified as separate movements if the offset of one tag and the 306 onset of the next were at least 300ms apart".Does that imply that annotations occurring closer together in time were considered to be part of the "same" head nod?
Answer: Yes, this reading is correct.We added a sentence in the manuscript to clarify this.
ln. 391-399: I feel like I already know this from your extensive introduction and background, does it have to be repeated here?
Answer: We deleted the repeated information.
ln. 408-415: I feel like this entire paragraph could be half a sentence as part of the previous one making it clear that these techniques are limited to 2D-data.
Answer: We deleted the repeated information.
ln.420-427:This also reads like a repetition of information already given in the introduction.You already explained what you're going to do in principle, so I think it would be fine to limit yourself to the actual methods and approach that you employed.

Answer:
We deleted the repeated information here also.Answer: In principle this a very nice idea, but we think it would be difficult to make it clear which exact nose keypoint is used, without having to zoom in considerably, which would then result in a very pixelated image.Instead, we have highlighted the used keypoints in the previous version of the image.
ln. 489-494: I find this discussion a bit weird, what are these percentages supposed to tell the reader?You stated in the "materials" section that feedback nods were extremely overrepresented in your original sample, which is why you added additional data in which you only annotated affirmative nods.The relative percentages here then do not really provide any information to the reader other than how often these nods occurred in your selected subset of data (which doesn't follow any objective selection criteria but was specifically expanded to kind of balance the occurrence of the two types of nods).In its current form this kind of reads to me as if these percentages would tell the reader something meaningful about the occurrence of these types of nodes "in the wild", whereas that is not the case because a part of the dataset was not annotated for both types of head nod.
Answer: Yes, you are right.We still think that this information is important for the reader.We leave this info and rather add a point to this discussion to make it explicit in the manuscript that these percentages only provide an overview information to the reader about what nod we analyze in our selected subset of data and that these numbers do not represent any natural occurrences since the data set was specifically expanded to kind of balance the occurrence of the two types of nods.
ln. 509: If I read the table correctly, what you are reporting here in the running text is the median not the mean?
Answer: That is right, this is something we overlooked.This has been now fixed and in the text we report the mean and the medians are reported on in the table.
ln. 529-536: This is very repetitive, as it has already been explained above at least once.Maybe cut?
Answer: We deleted the repeated information here.
Table 4: Are the t-test really meaningful here?That is, is the assumption of normality met?Given that you also ran non-parametric tests I assume that it's not the case.So, what is the reason for reporting them, respectively reporting both?
Answer: We included both since we expected these to be distributed normally, even though they seem to deviate from normal.We added a line on this in the manuscript (line 569 ff, line 711 ff).
ln. 618-632: What is the relationship between the co-occurrence (or lack thereof) of manual signs as well as mouthings with the signers turn-taking behavior?That is, it would be interesting to see what is the relationship between affirmative nods which mostly are not accompanied by manual lexical elements to turn taking-perhaps there is a relationship between presence of manual signs and initiating a new turn or similar?I don't want to force the authors to explore this, but it might be food for thought.
Answer: Thank you for this interesting direction of study.After receiving this comment, we wanted to report this value as well, so have performed a check of co-occurrence of manual signals with nods with different TT functions in our data.See the table below for quantitative results.It turned out that the pattern follows the pattern observed for the co-articulation of manual signs with different nods' function.Therefore, we don't feel like this finding will add significant depth to the analysis.However, it is something we are definitely interested in and will pursue in our future studies.
Answer: Rephrased in the manuscript to "In line with our hypothesis, our results reveal significant differences in […]".
ln. 681-682: Yes, but is this really surprising given that you yourself already cite studies with hearing speaker of English that apparently show the same distinction in head nods (ln.706)?Personally, I would not phrase all these findings as specific to DGS but rather discuss them from a more general perspective of communicative interactions which integrate audiovisual information in hearing speakers and layers different and partially simultaneously occurring visual information in deaf signers.That is, I don't really see a reason to assume any differences here a priori, instead signed and spoken interactions just may simply constitute different use cases of the same pragmatic behaviours and signals.
Answer: That is a possible explanation, and an elegant and tempting one, however one that we lack empirical evidence for yet.Comparison to spoken languages might be misleading here, as head movements can have grammatical function in SLs and therefore equality to SpLs regarding how they are performed is not necessarily a given.On the other hand, comparison to other studies of signed languages is tough, because not many such studies exist do far.However, in our future research we are planning to develop such a multilingual and multimodal typological perspective by adding more languages of different kinds to our data sample.
ln. 707-744: This seems a bit too extensive, everything was already state in detail in the running text before.So, I would cut that a bit short really just pointing to all potential shortcomings.Similarly, this is a matter of style once again, but I would not end the manuscript with the section "Limitations".Either add a short concluding paragraph after the limitations section or rearrange things a little bit in some other way.-Afterall you did good work, so that should also be the final message with which to leave the reader (imho).
Answer: Thank you.Good point.We shortened the section "Limitations" and added a short final section "Conclusion and future work" in order not to end the manuscript with "Limitations" ☺.

Reviewer #2:
The feedback head nod (especially passive recipiency signals) must be considered as extralinguistic, as they are not grammatical participants of the signing text, or in the other words, they are not linguistically functioning elements on the morpho-syntactic level.I would recommend to add some linguistic discussion about this issue in order to outline the meaning of the presented research and to improve its theoretical frame.
Answer: In our view the answer to this comment depends on the definition of what should be regarded as extralinguistic overall.The feedback signals (whether they are head nods or any other manual or non-manual cues or vocalizations like mhm or lexical words yes or signs YES) might not be relevant on the morpho-syntactic level, but does it make them "not linguistically functioning elements".We don't only consider things 'linguistic' if they are relevant on the morpho-syntactic level, therefore we disagree with the Rewiever#2 on this point.We hold the perspective that feedback elements play a significant linguistic role in interactional communication, thus regarding them as genuinely linguistic.
Reviewer #2: There is a lack of wide typological picture, which could lead the authors to some theoretical discussion.Although the authors mentioned about it in subchapter 'Limitations of this study and future work', saying that crosslinguistic and cross-modal differences in terms of phonetic properties of head nods might be expected and they intend to address this in future research.(74) Answer: Thank you for this point.Presenting a cross-modal perspective currently poses a challenge.Overall comparisons to spoken languags head nods might be misleading at this point, as head movements are known to have grammatical functions in signed languages and therefore equality to spoken languages regarding how they are performed is not necessarily a given.On the other hand, comparison to other studies of signed languages is tough, because not many such studies exist do far.However, in our future research we are planning to develop such a multilingual and multimodal typological perspective by adding more languages of different kinds to our data sample.

Figure 4 :
Figure 4: This is only a suggestion: Instead of reproducing these default images, it might be nicer to create a figure that shows the model and keypoints on top of a representative frame of your data.But please disregard if you consider this unnecessary or too much work.
Mouthings are transcribed in full, while all mouth gestures are annotated with a single tag ([MG]) and are not differentiated in further subtypes.".
Answer: Distribution of MG only is not possible to describe at the current stage on this research.Further annotations would be required.The MG annotations reported on in the paper relied on the annotations pre-existing in the DGS Corpus, which are not fully complete with and do not differentiate mouthings and MGs as we note in lines 614-617 and in footnote 5, which reads: "Either mouthing or mouth gesture.See[90, 92]for detailed information on these two mouth movement types.The Public DGS Corpus tier for mouth movement annotation included tags that differentiate between mouthing and mouth gestures.